11. Hyperparameter Tuning in Review
Heading
Hyperparameter Tuning in Review
Machine learning models use data to fit their internal parameters. However, all models also have parameters that configure how they work and aren’t modified during training, these parameters are called hyperparameters. We can train these hyperparameters by creating many different models, each with different hyperparameters and evaluating each model’s performance. Just like we can overfit model parameters, we can also overfit its hyperparameters. To avoid this, we can estimate performance using nested cross-validation.
Some hyperparameters affect how easily the model will overfit the data, sometimes at the expense of complexity. In our case, this hyperparameter was the depth of the trees in the forest. When we limited the tree depth to just 2, we saw the cross-validation error decrease substantially.
Another hyperparameter that we can modify is the number of features that we choose to model. Random forest models use some features more than others to classify the data. In sklearn
we can ask the RandomForestClassifier
which features were more important than others. By building a new model that only uses the 10 best features, we were able to improve our performance to 93%.
Q: When to use nested CV?
SOLUTION:
None of the aboveVideo Code instructions
Notebook Review
If you wanted to interact with the notebook in the video, you can access it here in the repo /activity-classifier/walkthroughs/hyperparameter-tuning/
or in the workspace below.
The dataset that will be used throughout this lesson can be found at the top of the lesson directory at /activity-classifier/data/
.
Code
If you need a code on the https://github.com/udacity.
Further Resources
Further Resources
Nested cross-validation can be a tricky concept to wrap your head around. Here are three different explanations from three different authors. Maybe one of the following resources will explain it in a way that clicks for you:
Our code implementing nested CV was pretty verbose so that you could see all the steps. As with almost everything in ML, sklearn
can do it for us as well and you can learn more about
Nested CV in sklearn
through the documentation.
Is overfitting our hyperparameters really a problem in practice? Yes (or so says this 2010 paper)
An explanation of the difference between hyperparameters and regular parameters with this article from Machine Learning Mastery.
If you want to learn more about Regularization through this article from Towards Data Science.
Glossary
- Hyperparameter: A parameter of the model that dictates how the model learns. This is not trained during the training process of the model itself.
- Regularization: Regularization is a technique to reduce overfitting of a model by discouraging complexity in the model.
- Nested cross-validation: A technique to determine model performance when hyperparameters are also optimized.
heading
Exercise 3: A Quirk in the Dataset
Instructions
- Complete the Offline or Online instructions below.
- Read through the whole
.ipynb
. - Complete all the code cells that contain
## Your Code Goes Here
.
Offline
- In the repo which you can access here in the repo
/activity-classifier/exercises/3-quirk-in-the-dataset/)
you should find the following files:
1_data_exploration.ipynb
- The dataset that will be used throughout this lesson can be found at the top of the lesson directory at
/activity-classifier/data/
. - Open up the python notebook and associated files in your desired editor.
Note: Instructions can be found in Introduction to Wearable Data's Concept Developer Workflow for how to set up your local environment.
Online
- Go to the next concept and the
3_quirk_in_the_dataset.ipynb
should be open and the workspace should already contain the appropriatedata
folder.